In the following exercises, we will use the data you have collected in the previous session (all comments for the video “The Census” by Last Week Tonight with John Oliver. Please note that your results might look slightly different than the output in the solutions for these exercises as we collected the comments earlier.
First we need to load the parsed comments data (NB: You might have to adjust the following code to use the correct file path on your computer).
comments <- readRDS("../data/ParsedComments.rds")
After loading the data, we go through the preprocessing steps described in the slides. In a first step, we remove newline commands from the comment strings (without emojis).
library(tidyverse)
comments <- comments %>%
mutate(TextEmojiDeleted = str_replace_all(TextEmojiDeleted,
pattern = "\\\n",
replacement = " "))
Next, we tokenize the comments and create a document-feature matrix from which we remove English stopwords.
library(quanteda)
toks <- comments %>%
pull(TextEmojiDeleted) %>%
char_tolower() %>%
tokens(remove_numbers = TRUE,
remove_punct = TRUE,
remove_separators = TRUE,
remove_symbols = TRUE,
split_hyphens = TRUE,
remove_url = TRUE)
comments_dfm <- dfm(toks,
remove = quanteda::stopwords("english"))
term_freq.
textstat_frequency() from the quanteda package to answer this question.
term_freq <- textstat_frequency(comments_dfm)
head(term_freq, 20)
## feature frequency rank docfreq group
## 1 census 1820 1 1398 all
## 2 people 1011 2 739 all
## 3 just 763 3 659 all
## 4 like 620 4 524 all
## 5 one 525 5 441 all
## 6 can 500 6 437 all
## 7 trump 494 7 443 all
## 8 know 458 8 405 all
## 9 get 442 9 394 all
## 10 john 431 10 401 all
## 11 government 391 11 314 all
## 12 question 374 12 318 all
## 13 us 371 13 307 all
## 14 many 363 14 310 all
## 15 citizens 359 15 260 all
## 16 country 296 16 248 all
## 17 even 295 17 274 all
## 18 think 293 18 268 all
## 19 want 282 19 243 all
## 20 illegal 281 20 216 all
docfreq from the term_freq object you created in the previous task.
term_freq %>%
arrange(-docfreq) %>%
head(10)
## feature frequency rank docfreq group
## 1 census 1820 1 1398 all
## 2 people 1011 2 739 all
## 3 just 763 3 659 all
## 4 like 620 4 524 all
## 7 trump 494 7 443 all
## 5 one 525 5 441 all
## 6 can 500 6 437 all
## 8 know 458 8 405 all
## 10 john 431 10 401 all
## 9 get 442 9 394 all
We also want to look at the emojis that were used in the comments on the video “The Census” by Last Week Tonight with John Oliver. Similar to what we did for the comment text without emojis, we first need to wrangle the data (remove missings, tokenize emojis, create DFM).
emoji_toks <- comments %>%
mutate(Emoji = na_if(Emoji, "NA")) %>%
mutate (Emoji = str_trim(Emoji)) %>%
filter(!is.na(Emoji)) %>%
pull(Emoji) %>%
tokens(what = "fastestword")
EmojiDfm <- dfm(emoji_toks)
EmojiFreq <- textstat_frequency(EmojiDfm)
head(EmojiFreq, n = 10)
## feature frequency rank docfreq group
## 1 emoji_facewithtearsofjoy 103 1 60 all
## 2 emoji_rollingonthefloorlaughing 37 2 21 all
## 3 emoji_thinkingface 29 3 18 all
## 4 emoji_grinningfacewithsweat 15 4 13 all
## 5 emoji_registered 14 5 4 all
## 6 emoji_fire 12 6 3 all
## 7 emoji_grinningsquintingface 10 7 6 all
## 8 emoji_loudlycryingface 10 7 6 all
## 9 emoji_unamusedface 9 9 9 all
## 10 emoji_clappinghands 8 10 2 all
EmojiFreq %>%
arrange(-docfreq) %>%
head(10)
## feature frequency rank docfreq group
## 1 emoji_facewithtearsofjoy 103 1 60 all
## 2 emoji_rollingonthefloorlaughing 37 2 21 all
## 3 emoji_thinkingface 29 3 18 all
## 4 emoji_grinningfacewithsweat 15 4 13 all
## 9 emoji_unamusedface 9 9 9 all
## 12 emoji_facewithrollingeyes 7 11 7 all
## 13 emoji_thumbsup 7 11 7 all
## 7 emoji_grinningsquintingface 10 7 6 all
## 8 emoji_loudlycryingface 10 7 6 all
## 11 emoji_smilingfacewithsunglasses 7 11 6 all
emoji_mapping_function.R file to see what this functions does.
source("../Scripts/emoji_mapping_function.R")
create_emoji_mappings(EmojiFreq, 10)
EmojiFreq %>%
head(n = 10) %>%
ggplot(aes(x = reorder(feature, -frequency), y = frequency)) +
geom_bar(stat="identity",
color = "black",
fill = "#FF74A6",
alpha = 0.7) +
geom_point() +
labs(title = "Most frequent emojis in comments",
subtitle = "The Census: Last Week Tonight with John Oliver (HBO)
\nhttps://www.youtube.com/watch?v=1aheRpmurAo",
x = "",
y = "Frequency") +
scale_y_continuous(expand = c(0,0),
limits = c(0,150)) +
theme(panel.grid.major.x = element_blank(),
axis.text.x = element_blank(),
axis.ticks.x = element_blank()) +
mapping1 +
mapping2 +
mapping3 +
mapping4 +
mapping5 +
mapping6 +
mapping7 +
mapping8 +
mapping9 +
mapping10